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The important task of determining the connectivity of gene networks, and at a more detailed 
level even the kind of interaction existing between genes, can nowadays be tackled by microarraylike 
technologies. Yet, there is still a large amount of unknowns with respect to the amount of data 
provided by a single microarray experiment, and therefore reliable gene network retrieval procedures 
must integrate all of the available biological knowledge, even if coming from different sources and 
of different nature. In this paper we present a reverse engineering algorithm able to reveal the 
underlying gene network by using time-series dataset on gene expressions considering the system 
response to different perturbations. The approach is able to determine the sparsity of the gene 
network, and to take into account possible a priori biological knowledge on it. The validity of 
the reverse engineering approach is highlighted through the deduction of the topology of several 
simulated gene networks, where we also discuss how the performance of the algorithm improves 
enlarging the amount of data or if any a priori knowledge is considered. We also apply the algorithm 
to experimental data on a nine gene network in Escherichia coli. 



I. INTRODUCTION 



The amount and the timing of appearance of the transcriptional product of a gene is mostly determined by reg- 
ulatory proteins through biochemical reactio ns that enhance or block polymerase binding at the promoter region 
( Jacob and Monod 19611 iDickson et al. 1975^ . Considering that many genes code for regulatory proteins that can 



activate or repress other genes, the emerging picture is conveniently summarized as complex network where the genes 
are the nodes, and a link between two genes is present if they interact. The identification of these networks is becoming 
one of the most relevant task of new large-scale genomic technologies such as DNA microarrays, since gene networks 
can provide a detailed understanding of the cell regulatory system, can help unveiling the function of previously 
unknown genes and developing pharmaceutical compounds. 



Different approaches have been prop osed to describe gene networks ( s ee (iFilkov 2005) for a review), and dif- 
ferent procedures have been propos ed ( Tone et al. 200 j . iLee et al. 2002 . Ildeker et al. "20011 . iDavidson et al. 20021 



lArkin et al. 19971 lYeung et al. 200^ to determine the network from experimental data. This is a computationally 
daunting t ask, which we address in the present work. Here we describe the network via deterministic evolution 
equations ( Teener et al. 20031 iBansal et al. looi), which encode both the strenght and the direction of interaction 



between two genes, and we discuss a novel reverse engineering procedure to extract the network from experimental 
data. This procedure, though remaining a quantitative one, realizes one of the most important goal of modern system 
biology, which is the integration of data of different type and of knowledge obtained by different means. 

We assume that the rate of synthesis of a transcript is determined by the concentrations of every transcript in a 
cell and by external perturbations. The level of gene transcripts is therefore seen to form a dy namical system whic h 
in the most simple scenario is described by the following set of ordinary differential equations ( de Jong et al. 2002f) : 



X{t)^ AX{t) + BU{t) (1) 

where X[t) = [xi{t), . . . ,xn (t)) is a vector encoding the expression level of Ng genes at times t, and U a vector 
encoding the strength of Np external perturbations (for instance, every element Uk could measure the density of a 
specific substance administered to the system). In this scenario the gene regulatory network is the matrix A (of 
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dimension Ng X Ng), as the element Aij measures the influence of gene j on gene i, with a positive Aij indicating 
activation, a negative one indicating repression, and a zero indicating no interaction. 

The matrix B (of dimension Ng x Np) encodes the coupling of the gene network with the A^p external perturbations, 
as Bik measures the influence of the fc-th perturbation on the i-th gene. 

A critical step in our construction is the choice of a linear differential system. Even if a such kind of model is based 
on particular assumptions on the complex dynamics of a gene network, it seem the only practical approach due to the 
lack of knowledge of real interaction mechanism between thousands of genes. Even a simple nonlinear approach would 
give rise to an intractable amount of free parameters. However, it must also be recognized that all other approaches 
or models have weakness points. For ins tance, boolean mod els (which have been very recently applied to inference 
of networks from time series data, as in ( Martin et al. 200?! ). strongly discretize the data and select, via the use of 
an arbitrary threshold, among active and inactive gene at every time-step. Dynamical Bayesian models, instead, are 
more data demanding than linear models due to their probabilisti c nature. Mor eover, their space complexity grows 
like Ng (at least in the famous Reveal Algorithm by K.P. Murphy ( Murphv 200l[ )l. which makes this tool suitable for 
small networks. 

The linear model of Eq. ([1]) is suitable to describe the response of a system to small external perturbations. It can 
be recovered by expanding to first order, and around the equilibrium condition X{t) = 0, the dependency of X on X 
and U, X{t) ~ f{X{t), U). Stability considerations {X{t) must not diverge in time) require the eigenvalues of A to 
have a negative real part. Moreover it clarifies that if the perturbation U is kept constant the model is not suitable to 
describe periodic systems, like cell cycles for example, since in this case X{t) asymptotically approaches a constant. 

Unfortunately data from a given cell type involve thousands of responsive genes Ng. This means that there are 
many different regulatory networks activated at the same time by the perturbations, and the number of measurements 
(microarray hybridizations) in typical experiments is much smaller than Ng. Conseq uently, inference m ethods can 
be successful, but only if restricted to a subset of the genes (i.e. a specific network) (jBasso et al. 20051) . or to the 
dynamics of genes subsets. These subsets could be either gene clusters, created by grouping genes sharing similar 
time behavior, or the modes obtained by using singular value decomposition (SVD). In these cases it is still possible 
to use Eq. ([1]), but X(t) must be interpreted as a vector encoding the time variation of the clusters centroids, or the 
time variation of the characteristics modes obtained via SVD. 

In this paper we present a method for the determination of the matrices A and B starting from time series experi- 
ments using a Global Optimization approach to minimize an appropriate figure of merit. With respects to previous 
attempts, our algorithm as the uses explicitly the insight provided by earlier studies on gene regulatory networks 
( Jenog et al. 2000 . IJenog et al. 200i[ ). namely, that gene networks in most biological systems are sparse. In order 
to code such type o f features the problem itself must be formulated as mixed-integer nonlinear optimization one 
( Hansen et al. 19931 ). Moreover our approach is intended to explicitly incorporate prior biological knowledge as, for 
instance, it is possible to impose that: Aij < (= 0, > 0, ^ 0) if it is known that gene j inhibits (does not influence, 
activates, influences) gene i. This means that the optimization problem is subject to inequality and/or equality con- 
straints. Summing up the characteristics of the problem wc must solve: high dimensionality, mixed integer, nonlinear 
programming problem for the exact solution of w hich no method exists. An approximate solution can be found 
efficiently using a global optimization techniques ( Horst and Pardalos 19951 iPardalos and Romeiin 2(30^ based on 
an intelligent stochastic search of the admissible set. As consequence of the optimization method used, there is no 
difficulties to integrates different time series data investigating the response of the same set of genes to different per- 
turbations, even if different time series are sampled at different (and not equally spaced) time points. The integration 
of different time series is a major achievement, as it allows for the joint use of data obtained by different research 
groups. We believe that the integration of multip le time-series d ataset in unveiling a gene network is a topic of great 
interest as focused in recently published papers ( Shi et al. 200l1 ). 

We illustrate and test the validity of our algorithm on computer simulated gene expression data, and we apply it 
to an experimental gene expression data set obtained by perturbing the SOS system in bacteria E. coli. 



II. METHODS 



The simplest assumption regarding the dynamical response of gene transcripts (intially in a steady state, X{t) = 
for t < 0), to the appearance of an external perturbation U{t) at time t > is given by Eq. ([1]). Since the state of 
the system measured at discrete times i = tfc, fc = 0, . . . , iVf , it useful to consider the discrete form of Eq. [T] 

Xitk+i)^AX{tk) + Uih,h+i), (2) 

where A is a matrix with dimension Ng x Ng, and [/ is a function of the perturbations, namely 
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A = cxp{AAt), 



U{tk,tk+i) 



exp{A{tk+i-T)}BU{T) dr. 



(3) 



Here we have assumed, for simplicity sake, tk = kA, but the generahzation to the most general case is straightforward. 
In particular, for constant U one gets B = U{tk,tk+i) = (cxp{^At} — 1)A~^BU. 

Due to the presence of noise the measured X{tk) do not coincide with the true values X{tk) expected to satisfy 
Eq. ([1]). If we for simplicity observed samples affected by independent, zero mean additive noise Sk, namely X(tk) ~ 
X{tk) + Sk, the matrices ruling the dynamics of Eq. ([1]) can be found by requiring the minimization of a suitably 
defined cost function. 

Under the simplifying assumption of a constant external perturbatio n, previous works have been focused on the 
determination of A and B (from which A and B can be retrieved), as in ([Bansal et al. 2006nIIolter et al. 2001i) . The 
matrices A and B have been assumed to be those minimizing the cost function 



CF{A,B)= J2 \Xitk+i)-iAXitk)+BU))\ 



(4) 



fe=0 



Eq. [T] can be written as standard linear least squares estimation problem for A, B, whose solution can be found 
by computing the pseudoinverse of a suitable matrix, providing that number of observations is sufficiently high: 
Nt>Ng + Np. 

In the present analysis we introduce a new reverse engineering approach to determine the matrices A and Z5, which 
turns out to be more efficient and flexible than previous ones. Our approach is based on the following considerations, 
which have not been taken into account in previous works: 

a. Each gene expression time-series could in principle be scanned according to both time versus, namely the 
time-reversibility of dynamics. 



b. There is a biological evidence suggesting that the matrix A is sparse ( Jenog et al. 2000l|Jenog et al. 200 ll ). For 
this reason any reverse engineering algorithm has to be able to capture the proper sparsity of gene regulatory 
network. 



c. In many situations there are biological prior information about bounds on numerical value of some specific 
entries of A and B. Such bounds must be taken into account in the solution procedure. 



As a consequence of this we use as a cost function the reduced Xrcd defined as follows: 



where 



Xrcd 



CT{A, B) 



(5) 



Nt-l r 



fe=0 



CT{AB) = \x{tk+i)-(AX{tk)~U{tk,tk+i))\ +\x{tk)-(A~^X{tk+i) + U{tk+i,tk)) 



(6) 



Note that A~^J7(tfe, tfe+i) = —Uitk+i^tk), and the quantities A and U{tk,tk+i) can be obtained from A and B by 
appropriate numerical approximation algorithms for Eq.s ([3]). The quantity a denotes the standard deviation of the 
independent; additive noise affecting the dataset. 

A straightforward optimization on dynamics/input matrices of Eq. Ilj is the main improvement of the proposed 
approach with respect to the previous ones. This is the only way that enable us to incorporate the sparscness require- 
ment on A, B and eventually available biological priors. It is clear that sparscness was destroyed by exponentiation 
and integration involved in the continuous-discrete transformation of the problem, in the same way simple bounds 
on A, B elements are transformed in highly complex nonlinear relations on A, B. The price paid for the flexibility of 
the approach is the computational effort required for any computation of the error function. This put a very strong 
attention to the efficiency of the optimization algorithm. 

In Eq. ^ the two contributions in square brackets account for the forward and backward propagation, respectively, 
and thus implement the time reversibility of the dynamics. Moreover, the sparsity of the gene network is taken into 
account via the number of degrees of freedom (d.o.f.) defined as njof = J^par — neq with ripar = Ng(Ng + Np) — Rzcro 
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the number of free parameters, rioq = Ng(Nt — 1) the number of equations (constraints) and n^c 
elements of A and B (a total of n^cm) fixed to zero. 

The generalization of the algorithm to the case in which there are different time-series, X"{tk), corresponding to the 
response of the same set of genes to similar and/or different perturbations B" with q = 1, Np is straightforward. 
In this case the cost function to be minimized is simply 



2 _ 1 ^CT{A,B'') 



Xrcd 



2ndof 



Z — ^ rr^ 



Here we have assumed the noise to depend on the time-series {a). It is clearly possible, however, to introduce a time 
(ifc) and even a gene (i) dependence, i.e to use a = cr^(ifc). 

We detail now our procedure to find the spare matrices A and B minimizing x^cd" which is in general a formidable 
task. The first difficulty is the determination of the number npar of not vanishing elements of A^ B (or equivalently 
the number of d.o.f. ndof )• Having determined ripar the problem is still very complicated since there are 



^par ^par ) • 

different ways of choosing these ripar elements out of the Ng{Ng + Np) candidates. For typical values of the parameters, 
for instance Ng = 10 and ripar = \/2Ng — 50, the number of possible combinations is of the order of 10'^^, so big 
that any kind of extensive algorithmic procedure is precluded. A practical approach to, at least approximately 
solve, this formidable problem is that of resort to a global optimization techniques based on a stochastic strategy to 
search of the admissible se t, for a comprehensive review os such type of methods one can see ( Horst and Pardalos 19951 



iHorst and Pardalos 1995t ). We h ave tackled this problem v ia the implementation of the more classical of such methods 
a simulated annealing procedure ( Kirkpatrick et al. 19831 ). based on a Monte Carlo dynamics. For each possible value 



of the number of parameters ripar, the algorithm search for the matrices A and B with a total of ripar non zero elements 
minimizing the cost function of Eq. ([S]), as discussed below. We then easily determine Xrad('^par) ^-^d the minimizing 
matrices ,4* and B* which are our best estimates of the true matrices. In order to determine the matrices A and B 
with a total of ripar non zero parameters which minimize the cost function, our simulated annealing procedure starts 
with two random matrices A and B with a total of rtpar not vanishing parameters, and changes the elements of these 
matrices according to two possible Monte Carlo moves. One move is the variation of the value of a not vanishing 
element of the two matrices, the other one consists in setting to zero a previously non-zero element, and to a random 
value a zero element. Each move, which involves a variation NZT of the cost function, is accepted with a probability 
exp[— ACF/T], where T is an external parameter. As in standard optimization by annealing procedures, we start 
from a high value of T, of the order of the cost function value, and then we slowly consider the limit T ^ 0. In the 
limit of infinitesimally small decrease of T the algorithm is able to retrieve the true minimum of the cost function, 
while for faster cooling rates estimates of the real minimum are recovered. 

As the Monte Carlo moves attempt to change the values of the elements of A and Z5, it is easy to introduce 
biological constraints on the values of Aij and of -B^-, as we will shown in a following example. The algorithm requires 
the evaluation of the cost function CT ^ which is a time consuming operation as the computation of the discrete matrix 
A and of its inverse A~^ are required. We have implemented this algorithm in C-I-+ making use of the GNU Scientific 
Library, www.gsl.org. 

III. RESULTS 

In this section, we illustrate our reverse engineering algorithm with three examples. The validity of our algorithm 
and of other known ones are evaluated by comparing the exact dynamical matrices A and B with their best estimate 
A^ and B* obtained via the reverse engineering procedure. To this end, we have introduced the parameter 

where C* = A^ or B* and |C| is the Li norm of the matrix C. Clearly, r\c ^ Oj tfic equality being satisfied if and only 
if C = C* . Since r\c is a measure of a relative error it has no upper bound, but the estimate of C becomes unreliable 
when r\c is above 1, i.e. |C — C*| > |C|. This parameter allows for a faithful evaluation of the quality of the reverse 
engineering approach, as it summarizes the comparisons of all retrieved elements Cy with their true values C*y 

We discuss three applications. First, we show how our algorithm works when applied to a single time series. In this 
case one can show that the cost function Xrcd(-^' -^)' which takes into account both the forward and the backward 
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FIG. 1: Synthetic time-series X{tk) with Ng — 8 elements measured at Nt — 20 equally spaced time-points. 



propagation, is more effective in determining the structure of the gene network than the usual cost function CF{A, B) 
of Eq. (j4]), which only considers the forward propagation. The second example shows how we can easily take into 
account the presence of different time-series, while the last example shows how biological priors can be included. 
Before discussing the examples we shortly describe the procedure used to generate the synthetic dataset. 



1. Generation of a synthetic dataset 

In order to generate a synthetic dataset X{tk) one must construct the matrices A and Z5, from which it is possible 
to generate the noiseless time-series X{tk). Hence, one gets X{k) = X{tk) + £k for k ~ 1, ..,iVf where £k are i.i.d. 
random variables with standard deviation <t. 

While there are no constraints on B, A must be a sparse random matrix whose complex eigenvalues have negative 
real part. The generation of A proceeds according the following steps. First, we generate a Ng x Ng block diagonal 
matrix A^^^ , whose Ng blocks are 2x2 antisymmetric matrices with diagonal elements A" and off diagonal elements 
A^a, or 1 X 1 negative real elements A. By direct constructions all of the Ng eigenvalues of the matrix have 
negative real part. Then we generate a series Rk of random unitary matrices, with only 4 off-diagonal not vanishing 
entries, and compute the matrices A'^ ~ RkA^'^^^^ R^"^ , all of them sharing the spectrum of A'"'. Clearly, as k grows, 
the number of vanishing entries (the sparsity) of A''^' decreases. We fix A as the matrix ^^'^^ characterized by the 
desired number of vanishing elements. By choosing typical values of A" and Af it is possible to control the time scale 
of the relaxation process of the system following the application of the perturbation. 



2. Example 1: a single time series 

Let us consider a simulated time-series X{tk) = (xi{tk), . . . ,xn {ik)) with Ng = 8 measured at Nt — 20 equally- 
spaced time-points, as shown in Fig.[TJ This dataset is generated by starting from a sparse gene network A (with only 
49 out of Ng = 64 non-zero elements), a constant perturbation U{t) ~ 1 and a sparse external perturbation-coupling 
matrix B with a single not vanishing entry. The white noise is characterized by a standard deviation 

i=i k=i 3 ^ 

measured in units of the mean absolute value of the expression levels of all genes. In particular the value p ~ 0.05 
has been used. 

We have applied our algorithm to this dataset. To this end, we have minimized the reduced chi-square Xrcd' defined 
in Eq. (O, for different values of the number of parameters ripar (i-e. of the number of degrees of freedom ridof)- Fig- HI 
shows that xfed ^ non-monotonic dependence on the number of parameters ripar- This feature is a signature of the 
fact that both networks with few or with many connections are bad descriptions of the actual gene regulatory system. 
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FIG. 2: The main panel (inset) show the dependence of x?od (of minimum of the cost function) on the number of not 
vanishing parameters ripar, as determined by our algorithm when applied to the time-series shown in Fig. [1] The fluctuations 



are due to the probabilistic nature of the Monte Carlo minimization procedure. 

39 parameters. 



The quantity Xrod varies non-monotonically 
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FIG. 3: We plot here the values of the element of the estimated matrices A*j, obtained both with the linear algebraic approach 
and with our algorithm, versus their true value Aij. Ideally, the points should line on the y = x dotted line. 



Accordingly, our best estimate of the number of not vanishing parameters is ri^^^ = 39, where Xrod minimum, 
and the corresponding minimizing matrices A* (with 33 non zero entries) and B* (with 6 non zero elements) are our 
best estimates of the actual gene network encoding matrix A and of the matrix B. The estimators assume the values 
r}j( = 0.76 and jyg = 0.005. These values indicate that, when applied to this small dataset, our algorithm is able to 
retrieve B to a very good approximation, and A with a comparatively larger error. 

For comparison, we have also obtained the matrices A and B which exactly minimize CF{A, B) via a linear algebraic 
approach, and retrieved the corresponding continuous matrices via the use of the bilinear transformation, obtaining 
the scores 77^ = 2.1 and 773 = 0.012. These numbers prove that by exploiting the time reversibility of the equation 
of motion, and the sparseness of the gene network, is it possible to estimate the parameters of the network with a 
greater accuracy, as also shown in Fig. [3] where wc plot the best estimates A*^ obtained by both methods versus their 
true values Aij: in the case of perfect retrieval all of the points should lie on the y = x line. 
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FIG. 4: Dependence of rjA on the fraction of priors, as obtained by analyzing one or two time-series. The scoring parameter 
■qA decreases as the number of priors increases, indicating that a better estimate of the gene network A is recovered. 



3. Example 2: multiple time series 



There are two major problems encountered when trying to infer a gene network via the analysis of time-series data. 
The first one is that there are usually to few time-points with respect to the large number of genes. The second 
one is associated to the fact that, when the system responds to an external perturbation, only the expression of the 
genes directly or indirectly linked to that perturbation changes, i.e., only a specific sub- network of the whole gene 
network is activated by the external perturbation. While through the study of the time-series it is possible to learn 
something about the regulatory role of the responding genes, nothing can be learnt about the regulatory role of the 
non-responding genes. 

These problems can be a ddressed by using gene network retrieval procedures which are able to simultaneously 
analyze different time-series (|Wang et al. 20061) . particularly if these measure the response of the system to different 
perturbations, as wc expect different perturbations to activate different genes. Our reverse engineering approach 
naturally exploits the presence of multiple time series by requiring the minimization of Eq. ([7])- 

Here wc study the network discussed in the previous example by adding to the time-series shown in Fig. [U other 
ones generated by the application of two different perturbations. For sake of simplicity all time-scries are measured at 
equally-spaced time-points, but with an elapsing time between two consecutive data points depending on the particular 
time-series. Hence that the problem cannot be reduced to the one of a single average time-series by exploiting the 
linearity of Eq. ([T]) . 

As the number of time-series increases, our determination of the gene network A becomes more and more 
accurate. For instance, while by means of a single perturbation wc obtain ry_4 = 0.76 (?7B(, = 0.005), by using 
two time-series we obtain ry^ = 0.25 (r/eo = 0.004, r/g^ — 0.003), and by using three time series we get 77,4 = 0.13 
(ilBo = 0.004, - 0.002,7/6, ^ 0.002). 



4- Example 3: biological priors 



As the traditional approach to research in Molecular Biology has been an inherently local one, examining and 
collecting data on a single gene or a few genes, there are now couples of gene which are known to interact in a 
specific way, or do not interact at all. This information is nowadays easily available by consulting pubic databases 
such as Gene Ontology. Here we show that it is possible to integrate this non-analytical information in our reverse 
engineering approach, improving the accuracy of the retrieved network. To this end we consider again the gene 
network A but we introduce some constraints on a fraction / of randomly selected elements of the matrices A and 
B, namely 10% < / < 40%. As our retrieval procedure tries to exchange vanishing and not vanishing elements of A 
and B we introduce the constraints as follows: if the clement is zero in the exact matrices then wc set it to zero and 
wc never try to set it to a non-zero value; on the contrary, if the clement is different from zero, its value is free to 
change and we never try to set it to zero. By using this approach we assure that our best estimates of A and B are 
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consistent with the previous knowledge. In order to stress the greater improvement that can be obtained via the use 
of biological priors, we consider now the same gene network A and perturbations of examples 1 and 2, but we corrupt 
the noiseless dataset by adding a noise (see Eq. (|10[1 ) characterized hy p = 0.1, and not hy p = 0.05 as before. Due 
to the high value of the noise the linear algebraic approach is not more able to recover the gene network matrix, as 
it obtains a score 77^ = 4.40. 

We show in Fig. |4] the dependence of rjA on the fraction of randomly selected elements of A and B fixed either to 
zero or to non-zero, both for the case in which only one or two perturbations have been used in the retrieval procedure. 
As expected, tja decreases as the number of priors increases, showing that as more biological knowledge on the system 
of interest is available the reliability of our reverse engineering approach improves. 



5. Results on Escherichia Coli 



We applied our algorithm to a nine gene network, part of the SOS network in E. Coli. The genes are recA, lex A, Ssb, 
recF, dini, umuDC, rpoD, rpoH, rpoS, and the used time-series consists of six time measurements (in triplicate) of 
the expression level of these genes following treat ment with Norfloxac in, a known antibiotic that acts by damaging 
the DNA. The time series is the same used in Ref. ( Bansal et al. 20061 ). and experimental details can be found there. 



Given N„ 



9 there are 90 unknowns to be determined, as ^ is a A'g x Ng matrix, and S is a vector of length Ng. 
Since Nt = 6, the experimental data allows for t he writing of No(Nt — 1) = 45 equations, and for the determination 
of only 45 unknowns, while a literature survey (jBansal et al. 20061 ) suggests that there are at list 52 connections 
between the considered genes (including the self-feedback). As in previous works, we are therefore forced to use an 
interpolation technique to add new time measurements, creating a time series with 11 time points. 

When applied to this dataset, our algorithm found that Xred minimized by a matrix A with 57 not vanishing 
entries, and a vector B with 6 non-zero elements, which are given in Table HI In the literature, there are 52 known 
connections between the nine considered genes, including the self- feedback.. We are able to find 37 of these connections. 
Regarding the interaction with Norfioxacin, our algorithm found that primary target is recA, as expected. 



recA lexA Ssb recF dinI umuDC rpoD rpoH rpoS 11 B 



recA 

lexA 

Ssb 

recF 

dinI 

umuDC 

rpoD 

rpoH 

rpoS 



■1.68 



-0.36 1.81 1.05 0.84 



-0.11 -1.56 0.59 0.58 0.40 



-0.47 1.82 -2.83 



0.68 0.42 



0.60 



1.18 0.72 0.39 -0.96 -1.71 0.42 



0.47 -0.63 -0.39 -0.64 0.19 -0.65 0.11 



-0.06 -0.28 



-0.39 -0.43 



0.36 



-1.10 1.60 -0.32 0.92 



-0.22 



-0.59 



-0.34 



0.96 -1.71 1.29 



-0.93 -0.52 -0.40 -0.30 1.13 



0.53 



-3.46 1.46 



0.18 0.92 0.26 0.82 -0.72 



0.71 



0.13 



0.38 



0.34 



0.40 



-0.11 



TABLE I: The matrix A encoding the SOS network for E. colv. each element codes the effect of the gene of the column on 
the gene of the raw. The last column shows the effect of Norfloxacin on the considered genes, B. All elements are expressed 
in 10~^s~^. The matrix A has 57 not vanishing elements, while B has 6 non-zero elements. For visualization purposes zero 
elements have been replaced by a dash 
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IV. CONCLUSIONS 

111 the framework of a linear deterministic description of the time evolution of gene expression levels, we have 
presented a reverse engineering approach for the determination of gene networks. This approach, based on the 
analysis of one or more time-series data, exploits the time-reversibility of the equation of motion of the system, the 
sparsity of the gene network and previous biological knowledge about the existence/absence of connections between 
genes. By taking into account this information the algorithm significatively improves the level of confidence in the 
determination of the gene network over previous works. 

The drawback of our procedure is the computational cost, which at the moment limits the applicability of the 
algorithm to a small number of genes/clusters. There are two time-consuming procedures. One is the transformation 
of the continuous matrix A in the discrete matrix A, which we have been avoided by using the bilinear transformation, 
but whose validity breaks down as the time interval between two consecutive measurements increases. The second 
one, which at the moment is the most expensive in time, is the computation of the inverse matrix A~^, which we 
accomplish through the so-called LU decomposition whose computational cost is 0{N^). Alternative methods for 
exploiting the reversibility of the dynamics should therefore by devised for applications with a larger number of 
genes. 
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